A Word Labeling Approach to Thai Sentence Boundary Detection and POS Tagging
نویسندگان
چکیده
Previous studies on Thai Sentence Boundary Detection (SBD) mostly assumed a sentence ends at a space and formulated the task SBD as a disambiguation problem, which classified a space either as an indicator for Sentence Boundary (SB) or non-Sentence Boundary (nSB). In this paper, we propose a word labelling approach which treats the space character as a normal word, and detects SB between any two words. This removes the restriction for SB to be occurred only at spaces and makes our system more robust for modern Thai writing. It is because in modern Thai writing, the space is not consistently used to indicate SB. As syntactic information contributes to better SBD, we further propose a joint PartOf-Speech (POS) tagging and SBD framework based on Factorial Conditional Random Field (FCRF) model. We compare the performance of our proposed approach with reported methods on ORCHID corpus. We also performed experiments of FCRF model on the TaLAPi corpus. The results show that the word labelling approach has better performance than previous space-based classification approaches and FCRF joint model outperforms LCRF model in terms of SBD in all experiments.
منابع مشابه
Neural Network Approach to Thai Part Of Speech Tagging
Thai part of speech (POS) tagging is a challenged problem in natural language processing. Many techniques including artificial neural network techniques are suggested for POS tagging. Research works in Thai POS tagging so far only focused on assigning word types, but not word features. This paper proposed a technique using multilayer perception for tagging word features in Thai sentences. The f...
متن کاملThe Automatic Thai Sentence Extraction
Unlike English, there is no explicit sentence marker in the Thai language. Conventionally, space is placed at the end of sentence in Thai writing. But it does not mean that space always indicates the sentence boundary. It is also used as other purposes [Danvivathana 1987]. This paper presents an algorithm to extract sentences from paragraph by detecting the true sentence breaking spaces, by app...
متن کاملAn improved joint model: POS tagging and dependency parsing
Dependency parsing is a way of syntactic parsing and a natural language that automatically analyzes the dependency structure of sentences, and the input for each sentence creates a dependency graph. Part-Of-Speech (POS) tagging is a prerequisite for dependency parsing. Generally, dependency parsers do the POS tagging task along with dependency parsing in a pipeline mode. Unfortunately, in pipel...
متن کاملBuilding A Large Thai Text Corpus - Part-Of-Speech Tagged Corpus: ORCHID -
This paper presents a procedure in building a Thai part-of-speech (POS) tagged corpus named ORCHID. It is a collaboration project between Communications Research Laboratory (CRL) of Japan and National Electronics and Computer Technology Center (NECTEC) of Thailand. We proposed a new tagset based on the previous research on Thai parts-of-speech for using in a multi-lingual machine translation pr...
متن کاملORCHID: Thai Part-Of-Speech Tagged Corpus
This paper presents a procedure in building a Thai part-of-speech (POS) tagged corpus named ORCHID [1]. It is a collaboration project between Communications Research Laboratory (CRL) of Japan and National Electronics and Computer Technology Center (NECTEC) of Thailand. We proposed a new tagset based on the previous research on Thai parts-of-speech for using in a multi-lingual machine translatio...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
دوره شماره
صفحات -
تاریخ انتشار 2016